Precomputing grammar masks: turn a GLR stack into an LLM token mask

I want to share a take on grammar‑constrained generation I’ve been working on for a while. The punchline is:

You can compute “which LLM tokens are valid next?” by reading the parser stack through a precomputed weighted automaton—instead of simulating the parser for every candidate token.

At runtime, mask generation becomes a tight loop of:

…and not “run the lexer+parser on a huge trie of token bytes every step.”

This post focuses on the runtime idea and the core compilation decomposition (token→terminal mapping, per‑terminal stack behavior, composition). The gnarly “how to actually build these automata without blowing up determinization/memory” details are real, and I’ll likely write those up separately.


The problem

You have a grammar (often “JSON Schema”, but let’s just say some CFG-ish constraint) and you want the LLM to produce syntactically valid output.

At each decoding step, the model produces logits over a fixed vocabulary VV (often ~100k–200k tokens). You sample one token, append it, repeat.

Constrained decoding means: don’t sample tokens that would make the output impossible to complete into a valid string.

Formally, if ww is the already‑generated token prefix, token vVv \in V is valid if it can still be extended to a full grammar string:

v is valid given w    u.  wvuL(G).v \text{ is valid given } w \iff \exists u.\; wvu \in L(G).

So the mask we want is:

GetMask(w)={vV:u.  wvuL(G)}.\text{GetMask}(w) = \{ v \in V : \exists u.\; wvu \in L(G)\}.

This “exists a completion” definition is the right one for decoding: you don’t just want “does the prefix parse so far?”; you want “did I dead‑end the grammar?”.


Two primitive operations

Think of the constraint system as two functions:

At runtime, decoding looks like:

repeat:
  (GPU) compute logits
  (CPU) compute mask
  apply mask to logits
  sample token
  commit token to both:
    - the model (KV cache / etc.)
    - the constraint state

You can overlap GPU and CPU work, but the mask is still on the critical path: if it’s late, the GPU waits.


Why worst‑case latency matters (not average)

Inference is batched. A single “slow” request can introduce bubbles or stall a batch scheduler.

So the question isn’t:

“Can we compute masks fast on average?”

It’s closer to:

“Can we compute masks under a fixed budget every single step?”

If you’re running high‑throughput decoding, your per‑token time budget can easily be in the sub‑millisecond to low‑millisecond range per stream. Even if your GPU forward pass is longer than 1ms, mask generation still needs to be stable: schedulers and batching strategies hate unpredictable tail latency.

Mask generation is also frequently branchy and cache‑unfriendly in naïve designs—exactly the kind of workload that looks fine at p50 and awful at p99.9.


Commit is incremental parsing (GLR + a graph‑structured stack)

The commit(token) side is “incremental parsing”: you feed the newly generated bytes to a lexer+parser and update the parse configuration.

The parsing model I’m assuming here is:

The important pieces:

So the constraint state is something like:

If you’ve seen tree‑sitter, Lezer, or classic GLR literature, this is that world.

Glossary (because “token” is overloaded)


The real problem: generating masks over LLM tokens

Given the current constraint state, we want a mask over LLM tokens.

A naïve statement is:

“A token is allowed if appending it keeps the parse valid.”

But the parser doesn’t consume LLM tokens. It consumes grammar terminals.

And the mapping is messy:

So “check token tt” actually means:

  1. feed its bytes into the grammar lexer (starting from the current lexer state)
  2. possibly emit one or more terminals (or partials)
  3. for each emitted terminal, update the parser (shift/reduce/goto), possibly branching GLR stacks
  4. ensure we did not dead‑end and that a completion still exists

Doing that for a single token is manageable.

Doing that for 200k tokens every step is not.


Why the obvious approaches hurt at tail latency

A common direction is:

  1. Build a trie/trellis of all vocabulary token byte strings.
  2. Traverse it while simulating the lexer+parser, pruning invalid branches.
  3. Collect which vocabulary tokens survive.

This avoids a literal “for each token” loop and shares prefixes.

It can work decently for many grammars.

But worst‑case behavior is still brutal, for a few reasons:

1) Long tokens force deep traversal

Modern vocabularies contain weird long tokens: repeated punctuation, long runs of spaces, long runs of digits, etc. (If you’ve inspected cl100k_base/cl200k_base, you’ve seen these.)

If a token is, say, 80–120 bytes long and those bytes are syntactically valid in the current context, then a trie traversal must still walk deep to confirm it.

This is already annoying even for deterministic parsing.

2) Reductions can chain, and GLR bookkeeping is pointer‑heavy

In LR parsing, a single terminal can trigger:

Those reductions pop stack frames and consult GOTOs, and in GLR they can create fan‑out.

Even if average behavior is fine, the tail can be dominated by:

If your mask algorithm is “simulate parsing along many trie edges”, you’re multiplying this work.

3) The inner loop is the wrong shape

In p99.9 land you want inner loops that look like:

Trie traversal + GLR is the opposite: pointer chasing, branching, irregular memory access.

I tried hard to engineer this style of approach into something stable. In my experience it’s a dead end if you care about strict, predictable latency.


Invert the question: stack → tokens (not tokens → stack)

Token validity depends on the current parse configuration—especially what’s on the LR stack.

So instead of asking:

“For each token tt, does it work on this stack?”

ask:

“Given this stack, which tokens work?”

That suggests a precomputed function:

f(stack,lexer state)bitset over Vf(\text{stack}, \text{lexer state}) \to \text{bitset over } V

…which we can evaluate fast at runtime.

The key enabling fact from LR theory is:

The set of viable prefixes (strings that can appear on the parser stack during a valid parse) forms a regular language (Knuth).

Informally: although the grammar language is context‑free, the space of possible stacks is regular, recognized by a finite automaton (the LR(0) automaton).

So there exists a finite automaton that can read (a suffix of) the stack and decide things about future validity.

We’ll use that to compute a mask, not just accept/reject.


Weighted automata over parser state IDs

Here’s the core data structure:

Think “finite automaton with bitset weights.”

The semiring intuition

We use bitsets 2V2^V with:

This fits the intuition:

What the automaton does at runtime

At mask time, we:

  1. Start in an initial automaton state (depends on the current lexer/tokenizer state).
  2. Consume stack state IDs from top to bottom.
  3. Intersect bitset weights as we go.
  4. Stop early once further stack depth cannot change the result (bounded lookback).
  5. Output the resulting vocabulary bitset.

This is the “read‑the‑stack” idea.


Runtime algorithm (single stack)

If the automaton is deterministic, the conceptual runtime is almost embarrassingly simple:

def get_mask_single_stack(dwa, lexer_state, stack_top_to_bottom):
    q = dwa.start_state_for_lexer(lexer_state)
    mask = ALL_TOKENS_BITSET

    for sid in stack_top_to_bottom:
        q, w = dwa.step(q, sid)     # table lookup
        mask &= w                  # bitset AND

        if dwa.can_stop_here(q, mask):
            break

    mask &= dwa.final_weight(q)
    return mask

In practice you may still track a small frontier (e.g., multiple lexer states, or running over a GSS), but the inner operations stay “lookup + AND/OR”.


Runtime algorithm (GLR / GSS)

With GLR you don’t have one stack; you have a DAG of stacks (the GSS). Each path from a top node down to the root corresponds to one possible LR stack.

For get_mask, you want:

a token is valid if it is valid on any surviving parse stack.

So we run the same propagation, but over a graph:

The shape is the same: \cap along edges, \cup at merges.

You can implement it as a dynamic program / worklist over (gss_node, automaton_state) pairs. The key is: it’s still just table lookups and bitset ops; no “simulate parser transitions for every vocab token”.


Early stopping: don’t read the whole stack

Two different (but compatible) reasons we can stop early:

  1. Theory / structure: For well‑behaved grammars, deciding what can happen next depends only on a bounded suffix of the stack. (More below.)
  2. Engineering: Even if the stack is deep, the mask often becomes “determined” after a few steps. Once the automaton reaches a region where deeper symbols can’t further filter the surviving token set, you stop.

A useful way to engineer this is to make the automaton explicitly track “already accepted tokens” versus “still undecided tokens,” so you can stop when the undecided set becomes empty. Conceptually:

This is one of those optimizations that’s obvious in hindsight: you want to avoid doing work on bits whose fate is already fixed.


Where does the automaton come from?

So far I’ve described the runtime object as if it just exists.

The compilation story has three pieces:

  1. Terminal DWA: maps LLM tokens → sequences of grammar terminals (plus lexer state evolution).
  2. Template automata: for each grammar terminal, describe how it can transform the LR stack (reductions + shift).
  3. Composition: splice those together into a single automaton that reads stack state IDs and outputs a token mask.

I’ll sketch each. This is the minimum detail needed to make the idea “real” without diving into all the compile-time engineering.


1) Token → terminal mapping (Terminal DWA)

A grammar is defined over terminals. A language model produces LLM tokens.

These symbol sets don’t line up.

So we precompute, for each lexer/tokenizer state, which vocabulary tokens can produce which terminal sequences.

What makes this hard?

The Terminal DWA idea

Build a trie over all vocabulary token byte sequences. Then simulate the grammar lexer over the trie.

Whenever the lexer recognizes a terminal on some trie path, record:

This gives you an automaton whose:

Then determinize it (subset construction) to merge equivalent lexer configurations.

The result is: given a starting lexer state, you can walk through “what terminals could this token emit?” while carrying along exactly which vocab tokens correspond to those terminals.

You can think of this as a reversed transducer: instead of mapping token → terminals, it maps terminal-sequence paths → sets of tokens that realize them.


2) Terminal → stack behavior (Template automata)

Now assume a grammar terminal tt has been recognized.

What happens to the LR parser stack?

In LR parsing, on lookahead terminal tt, the parser consults the ACTION table at the current top-of-stack state:

So before you can shift tt, you may need a chain of reductions. That chain depends on deeper stack elements because reductions pop.

Key observation

Even though the LR stack can be arbitrarily deep, the behavior “what reductions happen before shifting tt?” depends only on some suffix of the stack.

You can capture all possible reduction chains for terminal tt as a finite graph that:

This is basically a finite automaton with a “stack effect” register.

Modeling stack effects cleanly: push/pop words and cancellation

It’s convenient to represent stack operations as a word over symbols:

Concatenating two terminal effects concatenates these words.

Then you apply the obvious cancellation rule:

Algebraically this is the polycyclic monoid; practically it’s “well-typed push/pop cancellation.”

Why do we care? Because LLM tokens often correspond to multiple terminals; composing per-terminal stack effects and canceling adjacent push/pop pairs massively simplifies what you need to inspect on the original stack.

Template automaton (“Template VA”)

For each terminal tt, we build a small NFA-like object whose accepting paths correspond to:

This automaton is derived directly from the LR parse table: edges correspond to “this state shifts tt” or “this state reduces by production pp” and so on.

Determinize it so that later composition is manageable.


3) Compose everything into the Parser DWA (stack → token mask)

Now we have:

To get “stack → vocab mask”:

  1. Take the Terminal DWA.
  2. Replace each terminal edge “tt” with the Template automaton for tt.
  3. Propagate the vocab-token bitset weight onto the inserted structure.
  4. When a single vocab token corresponds to multiple terminals, you’re effectively concatenating multiple templates—so apply push/pop cancellation.
  5. Determinize the result to get a final automaton that:

This final structure is what I call the Parser DWA.

Why pushes can be dropped (for masking)

For get_mask, we only care whether a token is valid, not exactly what new stack would result.

The actual stack update is handled by commit(token) using the real lexer+parser.

So during compilation, after composing and canceling, any remaining “push” symbols that represent “the parser would push state X” can be discarded from the mask automaton. The runtime commit will do the true stack manipulation.

This is also what lets you do useful things like: start from an arbitrary prompt prefix (maybe partially through a JSON value) and commit raw bytes without needing to align to LLM token boundaries.


Bounded lookback (why we don’t read the whole stack)

There are two complementary views here:

View 1: reduction chains must be bounded (after normalization)

A parser can, in principle, take arbitrarily long chains of reductions without consuming input if the grammar has certain recursive patterns (right recursion / hidden left recursion). There’s classic work (Aycock & Horspool) characterizing when reduction chains can be unbounded and how to transform grammars to avoid it.

After the usual normalizations, the number of reductions you need to consider before a shift becomes bounded by a constant (for that grammar). That implies: deciding token validity only requires looking at a bounded suffix of the stack.

View 2: viable prefixes are regular

Classical LR theory says viable prefixes form a regular language and are recognized by a finite automaton (the LR(0) automaton). Our Parser DWA is essentially a weighted extension of that automaton: instead of merely accepting/rejecting a stack configuration, it computes “which vocab tokens extend it.”

Regularity is the deep reason “finite automaton reading stack suffix” is even possible.


Tokenization ambiguity and longest-match (streaming lexing is forward-looking)

There’s one more annoying real-world detail: the grammar lexer is streaming, and “token boundaries” don’t align with LLM tokens.

Even with longest-match lexing (“maximal munch”), you can’t always commit a terminal the moment it matches, because it might still be extendable.

Example:

In a traditional lexer, you solve this with lookahead on the raw input stream. In incremental decoding, the raw input arrives token-by-token, so you need a streaming strategy.

Inhibited terminals (one workable strategy)

When a terminal tt matches at some point but is still extendable, fork the lexer/parser state:

If later bytes allow a longer match, prune the premature branch.

This creates multiple active lexer/parser states—exactly what GLR + GSS machinery is good at representing compactly.

Masking with ambiguity

If you have multiple active parse configurations (due to lexer ambiguity, inhibited terminals, or LR conflicts), then:

an LLM token is allowed if it is allowed in any active configuration.

So get_mask unions the bitsets across branches.

That “union across alternatives” is exactly why the weighted automaton story uses union as the merge operator.


Putting it together: fast get_mask, honest commit

At runtime you do two very different kinds of work:

commit(token) (parser work)

This is “real parsing.” It can be complicated and it inherits the usual GLR theoretical caveats.

get_mask(state) (automaton work)

This is the crucial separation: mask generation never simulates reduction chains at runtime. That work has been compiled away into the automaton.


Why this feels close to optimal (without claiming a theorem)

I’m not going to claim a formal optimality theorem here; parsing theory is too spiky for that.

But the division of labor feels “asymptotically right”:

Also: you can’t avoid touching V|V| bits if the output is literally a V|V|-bit mask. The goal is to make that bit touching occur in a small, fixed number of AND/ORs, not inside a giant vocabulary loop with parsing simulation.


Fast run, slow compile

Everything above punts on the hardest engineering fact:

building these automata can blow up if you’re careless.

You’re doing determinization and minimization in a weighted / semiring-ish setting, and real grammars are large.

Most of my time went into making compilation time and memory sane:

That’s a whole separate writeup.


Summary

If you only remember one idea from this post, it’s this:

The result is a get_mask whose runtime behavior is predictable enough to care about in p99.9 land, while commit remains a standard incremental GLR parser update.

If/when I write the follow-up, it’ll be about the compile-time tricks: building the Terminal DWA, building the per-terminal templates, composing/canceling deterministically, and keeping the result small.